VIBRANT Trial - Primary Analyses

Author

Laura Symul, Laura Vermeren, Susan Holmes

Code
library(tidyverse)
library(magrittr)
library(gt)
library(patchwork)
library(SummarizedExperiment)
library(tidySummarizedExperiment)
library(brms)
# library(mia) # BiocManager::install("mia")

theme_set(theme_light())
tmp <- fs::dir_map("R", source)
Code
data_source <- "simulated"

Introduction

This document presents the primary analyses of the VIBRANT trial.

Currently, this runs on simulated data, but the code is mostly ready to be run on the real data once it is available.

The simulated data is currently relatively realistic (but still needs some improvement), and is very optimistic (assumes a very successful trial).

Until we have the actual data, we intend to improve the simulated data (= make them more realistic) by

  • Model the metagenomic data from an actual file provided by Jacques’ lab (currently, we “invented” the format, but it would be much better to have an actual pilot example).
  • Add demographic variables
  • Add dropouts (participants who left the study before the endpoint), missed visits or missed data (failed assay or uncollected sample)
  • Add replacements
  • Add differences in LBP strain survival rates between sites, arms, and participants
  • Add inter-individual variability in how long the LBP strains survive
  • Add bacterial species (currently, we only have the 15 LBP strains, “other” L. crispatus, L. iners, and “non-Lacto” (lumped together))
  • (add contamination)

Consequently, some (supplementary) analyses are still missing

  • Demographics (Table 1)
  • Sensitivity analyses for the different populations (ITT, mITT, PP)
  • … + search for “TODO” or “XXX” within the document (placeholders for analyses that still need to be implemented or text that needs to be written)

These will be implemented once the corresponding improvement steps detailed above are themselves implemented.

Even with all these improvements, there will, inevitably, be un-anticipated issues with the real data, but the core of the methods (and corresponding code) will likely be unchanged.

Workflow

The document is organized as follow

  • Data preparation: load the raw data as provided by labs/teams, and collate them into a neat MultiAssayExperiment object containing all the data linked by participant x visit.

  • Checks, QC, and Outcomes Definitions: check the data for potential issues, perform quality controls (e.g, check for library sizes, potential contamination, etc.), and define outcomes of interest (colonization status, etc.)

  • Analyses: perform the primary analyses of the trial, including demographics, primary and secondary outcomes, and sensitivity analyses. This section has the main tables and figures.

Data preparation

Loading the data and creating SummarizedExperiment objects for each assays

In this section, we load the raw data from their original files (i.e., the data/files as generated by the corresponding labs or teams) and transform each assay into a SummarizedExperiment object. This will enable to then collect them into a single MultiAssayExperiment object and perform the integrative analyses.

To link the assays together,we define a common unique identifier for each participant x visit. This unique identifier (.sample) is the concatenation of the pid (participant ID) and the visit_code.

Clinical and survey (CRF) data

Information about the data

The data acquisition process is described XXXHEREXXXX.

The data encoding in a xxx database and quality control processes are described XXXHEREXXX.

The data was exported from the database by Lara Lewis (CAPRISA, South Africa) as a set of xxx files.

Lara Wautier (UCLouvain, Belgium) then imported these files in R and assembled the data into two tables: a participants and a visits table. The code is available XXXHEREXXX.

The participants table contains participant-invariant information such as the participant ID, the treatment arm, site, and study population (ITT, mITT, PP). The visits table contains participant-varying information collected at each visit.

The columns necessary for the primary trial analyses were then selected from these two and exported into two files: participants.RDS and visits.RDS, which we load below.

Loading the data

We load the participant and visits data, and create

  • one table which will become the MAE’s colData
  • one SE object which provides attendance information at each visit (this also ensures that the MAE object has data for all visits, which is useful for downstream analyses)
Code
participants <- readRDS(file = str_c(data_dir(), "00 clinical data/participants.RDS"))
visits <- readRDS(file = str_c(data_dir(), "00 clinical data/visits.RDS"))

clin <- 
   participants |> left_join(visits, by = join_by(pid)) |> 
  mutate(sample_id = str_c(pid, "_", visit_code), rownames = sample_id) |> 
  select(sample_id, everything()) 

mae_coldata <- 
  clin |> 
  as.data.frame() |> 
  column_to_rownames(var = "rownames")
Code
make_se_clin <- function(clin, cols, assayname){
  assay <- 
    clin |> 
    select(sample_id, all_of(cols)) |> 
    as.data.frame() |> 
    column_to_rownames("sample_id") |> 
    t()
  
  assay_list <- list()
  assay_list[[assayname]] <- assay
  
  SummarizedExperiment::SummarizedExperiment(
    assays = assay_list
  )
  
}
Code
clin <- clin |> mutate(planned = TRUE, attended = ifelse(str_detect(visit_name, "Daily"), NA, TRUE))
Code
se_va <- make_se_clin(clin, cols = c("planned", "attended"), assayname = "visit_attendance")

Weekly metagenomics data with LBP strains relative abundances

TODO: add information about how the data was generated (summary of the methods)

We load the weekly metagenomics data, the manifest for the metagenomic barcodes, and the strain information.

Code
mg <- read_csv(str_c(data_dir(), "01 metagenomics/mg.csv"))
mg_manifest <- read_csv(str_c(data_dir(), "01 metagenomics/mg_manifest.csv"))
LBP_strain_info <- readxl::read_xlsx(str_c(dropbox_dir(), "9_VIBRANT_Study Product/IsolateNumbers.xlsx"))

LBP_strain_info <- 
 LBP_strain_info |> 
  mutate(strain_id = `VMRC ID`, 
         LBP = ifelse(is.na(LC106), "LC-106 & LC-115", "LC-106") |> factor(), 
         strain_origin = `Geographic source` |> factor()
  ) |> 
  dplyr::rename(Biose_ID = `Biose ID`, VMRC_ID = `VMRC ID`) |> 
  arrange(strain_origin, LBP) |> 
  mutate(strain_id = strain_id |> fct_inorder()) |> 
  select(strain_id, LBP, strain_origin, contains("ID"))

LBP_strain_info |> gt(caption = "LBP strain information")
LBP strain information
strain_id LBP strain_origin Biose_ID VMRC_ID
FF00018 LC-106 SA GA08 FF00018
FF00051 LC-106 SA GA09 FF00051
UC101_2_33 LC-106 SA GA12 UC101_2_33
FF00004 LC-106 & LC-115 SA GA07 FF00004
FF00064 LC-106 & LC-115 SA GA10 FF00064
FF00072 LC-106 & LC-115 SA GA11 FF00072
UC119_2_PB0238 LC-106 & LC-115 SA GA13 UC119_2_PB0238
122010_1999_16 LC-106 & LC-115 SA GA14 122010_1999_16
185329_1999_17 LC-106 & LC-115 SA GA15 185329_1999_17
C0059E1 LC-106 US GA03 C0059E1
C0022A1 LC-106 US GA04 C0022A1
C0175A1 LC-106 US GA06 C0175A1
C0006A1 LC-106 & LC-115 US GA01 C0006A1
CC0028A1 LC-106 & LC-115 US GA02 CC0028A1
C0112A1 LC-106 & LC-115 US GA05 C0112A1
Code
mg_to_SE <- function(mg, mg_manifest, LBP_strain_info){
  mg_manifest <- 
    mg_manifest |> 
    mutate(sample_id = str_c(pid, "_", visit_code))
  
  if (any(duplicated(mg_manifest$sample_id))) stop("Duplicated `uid` in `mg_manifest`")
  
  mg <- mg |> left_join(mg_manifest, by = join_by(barcode))
  tmp <- mg |> select(barcode, sample_id) |> distinct()
  if (any(duplicated(tmp$sample_id))) stop("Duplicated `uid` in `mg`")
  
  assay_prop_of_Lc <- 
    mg |> 
    select(sample_id, strain, prop_of_Lc) |> 
    pivot_wider(names_from = sample_id, values_from = prop_of_Lc) |> 
    as.data.frame() |> 
    column_to_rownames("strain") 
    
  assay_unique_kmers <- 
    mg |> 
    select(sample_id, strain, unique_kmers) |> 
    pivot_wider(names_from = sample_id, values_from = unique_kmers) |> 
    as.data.frame() |> 
    column_to_rownames("strain")
  
  se_coldata <- 
    mg |> 
    select(sample_id, barcode, shared_kmers) |> 
    distinct() |> 
    as.data.frame() 
  
  se_rowdata <- 
    mg |> 
    select(strain) |> 
    distinct() |> 
    left_join(
      LBP_strain_info,
      by = join_by(strain == strain_id)
    ) 
  
  SummarizedExperiment::SummarizedExperiment(
    assays = list(prop_of_Lc = assay_prop_of_Lc, unique_kmers = assay_unique_kmers),
    rowData = se_rowdata,
    colData = se_coldata
  )
}
Code
(se_mg <- mg_to_SE(mg, mg_manifest, LBP_strain_info))
# A SummarizedExperiment-tibble abstraction: 14,400 × 12
# Features=16 | Samples=900 | Assays=prop_of_Lc, unique_kmers
   .feature       .sample prop_of_Lc unique_kmers sample_id barcode shared_kmers
   <chr>          <chr>        <dbl>        <dbl> <chr>     <chr>          <dbl>
 1 FF00018        p1_1000          0            0 p1_1000   mg_p1_…           48
 2 FF00051        p1_1000          0            0 p1_1000   mg_p1_…           48
 3 UC101_2_33     p1_1000          0            0 p1_1000   mg_p1_…           48
 4 FF00004        p1_1000          0            0 p1_1000   mg_p1_…           48
 5 FF00064        p1_1000          0            0 p1_1000   mg_p1_…           48
 6 FF00072        p1_1000          0            0 p1_1000   mg_p1_…           48
 7 UC119_2_PB0238 p1_1000          0            0 p1_1000   mg_p1_…           48
 8 122010_1999_16 p1_1000          0            0 p1_1000   mg_p1_…           48
 9 185329_1999_17 p1_1000          0            0 p1_1000   mg_p1_…           48
10 C0059E1        p1_1000          0            0 p1_1000   mg_p1_…           48
# ℹ 310 more rows
# ℹ 5 more variables: strain <chr>, LBP <fct>, strain_origin <fct>,
#   Biose_ID <chr>, VMRC_ID <chr>

Checks & Quality Controls

Total unique kmers per sample

Code
se_mg |> 
  as_tibble() |> 
  group_by(.sample) |> summarize(total_unique_kmers = sum(unique_kmers)) |> 
  ggplot(aes(x = total_unique_kmers)) +
  geom_histogram() +
  scale_x_log10()

Data overview

Code
se_mg |> 
  as_tibble() |> 
  mutate(.feature = .feature |> fct_inorder()) |> 
  ggplot(aes(x = .feature, y = .sample, fill = prop_of_Lc)) +
  geom_tile() +
  scale_fill_continuous(low = "white", high = "steelblue2") +
  scale_x_discrete("Strains") +
  scale_y_discrete("Samples",breaks = NULL) +
  facet_grid(. ~ LBP + strain_origin, scales = "free_x", space = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Weekly 16S rRNA data

TODO: add a summary of the protocols and methods used to generate the 16S data

We load the 16S rRNA data, which is stored as a phyloseq object.

Code
(w16S <- readRDS(str_c(data_dir(), "02 weekly 16S/w16S.RDS")))
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3 taxa and 1000 samples ]
sample_data() Sample Data:       [ 1000 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 3 taxa by 6 taxonomic ranks ]

We transform the phyloseq object into a SummarizedExperiment object.

Code
phyloseq_to_SE <- function(physeq){
  
  sample_info <- 
    physeq@sam_data |>
    as.data.frame() |> 
    set_class("data.frame") |> 
    rownames_to_column("barcode") |> 
    as_tibble() |> 
    mutate(sample_id = str_c(pid, "_", visit_code))

  counts <- physeq@otu_table |> as.data.frame() |> set_class("data.frame")
  if (!phyloseq::taxa_are_rows(physeq)) counts <- counts |> t()
  counts <- counts |> set_colnames(sample_info$sample_id)
  rel <- t(t(counts)/colSums(counts)) |> as.data.frame()
  
  SummarizedExperiment::SummarizedExperiment(
    assays = list(ASV_counts = counts, ASV_prop = rel),
    rowData = physeq@tax_table |> as.data.frame() |> set_class("data.frame"),
    colData = sample_info |> as.data.frame() |> column_to_rownames("sample_id")
  )
}
Code
(se_16S <- phyloseq_to_SE(w16S))
# A SummarizedExperiment-tibble abstraction: 3,000 × 13
# Features=3 | Samples=1000 | Assays=ASV_counts, ASV_prop
   .feature  .sample ASV_counts ASV_prop barcode pid   visit_code Kingdom Phylum
   <chr>     <chr>        <dbl>    <dbl> <chr>   <fct>      <dbl> <chr>   <chr> 
 1 L. crisp… p1_0            30    0.24  16S_p1… p1             0 Bacter… Firmi…
 2 L. iners  p1_0            36    0.288 16S_p1… p1             0 Bacter… Actin…
 3 non-Lacto p1_0            59    0.472 16S_p1… p1             0 Bacter… Actin…
 4 L. crisp… p1_1000         49    0.340 16S_p1… p1          1000 Bacter… Firmi…
 5 L. iners  p1_1000         42    0.292 16S_p1… p1          1000 Bacter… Actin…
 6 non-Lacto p1_1000         53    0.368 16S_p1… p1          1000 Bacter… Actin…
 7 L. crisp… p1_1100         45    0.344 16S_p1… p1          1100 Bacter… Firmi…
 8 L. iners  p1_1100         36    0.275 16S_p1… p1          1100 Bacter… Actin…
 9 non-Lacto p1_1100         50    0.382 16S_p1… p1          1100 Bacter… Actin…
10 L. crisp… p1_1200         47    0.362 16S_p1… p1          1200 Bacter… Firmi…
# ℹ 50 more rows
# ℹ 4 more variables: Class <chr>, Order <chr>, Family <chr>, Genus <chr>

Checks & Quality Controls

Total reads per sample

Code
se_16S |> 
  as_tibble() |> 
  group_by(.sample) |> summarize(total_counts = sum(ASV_counts)) |> 
  ggplot(aes(x = total_counts)) +
  geom_histogram() +
  scale_x_log10()

Data overview

Code
se_16S |> 
  as_tibble() |> 
  group_by(.feature) |> mutate(total_prop = sum(ASV_prop)) |> ungroup() |> 
  arrange(-total_prop) |> mutate(.feature = .feature |> fct_inorder()) |> 
  ggplot(aes(x = .feature, y = .sample, fill = asinh(ASV_counts))) +
  geom_tile() +
  scale_fill_continuous(low = "white", high = "steelblue2") +
  scale_x_discrete("ASVs", breaks = NULL) +
  scale_y_discrete("Samples",breaks = NULL) 

Integrating all assays into a MultiAssayExperiment object

In this section, we assemble all SE objects created in the previous section into a single MultiAssayExperiment object.

colData

Code
mae_coldata |> as_tibble()
# A tibble: 4,600 × 9
   sample_id pid   site  arm   visit_code visit_abbr visit_name     visit_number
   <chr>     <fct> <fct> <fct>      <dbl> <fct>      <fct>                 <dbl>
 1 p1_0      p1    CAP   Pl             0 SCR        Screening                 0
 2 p1_1000   p1    CAP   Pl          1000 ENR        Enrollment                1
 3 p1_1001   p1    CAP   Pl          1001 D-ABX      Daily ABX                NA
 4 p1_1002   p1    CAP   Pl          1002 D-ABX      Daily ABX                NA
 5 p1_1003   p1    CAP   Pl          1003 D-ABX      Daily ABX                NA
 6 p1_1004   p1    CAP   Pl          1004 D-ABX      Daily ABX                NA
 7 p1_1005   p1    CAP   Pl          1005 D-ABX      Daily ABX                NA
 8 p1_1006   p1    CAP   Pl          1006 D-ABX      Daily ABX                NA
 9 p1_1007   p1    CAP   Pl          1007 D-ABX      Daily ABX                NA
10 p1_1100   p1    CAP   Pl          1100 Post-ABX   Post antibiot…            2
# ℹ 4,590 more rows
# ℹ 1 more variable: day <dbl>

Assays

Visit attendance

Code
se_va
# A SummarizedExperiment-tibble abstraction: 9,200 × 3
# Features=2 | Samples=4600 | Assays=visit_attendance
   .feature .sample visit_attendance
   <chr>    <chr>   <lgl>           
 1 planned  p1_0    TRUE            
 2 attended p1_0    TRUE            
 3 planned  p1_1000 TRUE            
 4 attended p1_1000 TRUE            
 5 planned  p1_1001 TRUE            
 6 attended p1_1001 NA              
 7 planned  p1_1002 TRUE            
 8 attended p1_1002 NA              
 9 planned  p1_1003 TRUE            
10 attended p1_1003 NA              
# ℹ 30 more rows

Weekly metagenomics data with LBP strains relative abundances

Code
se_mg
# A SummarizedExperiment-tibble abstraction: 14,400 × 12
# Features=16 | Samples=900 | Assays=prop_of_Lc, unique_kmers
   .feature       .sample prop_of_Lc unique_kmers sample_id barcode shared_kmers
   <chr>          <chr>        <dbl>        <dbl> <chr>     <chr>          <dbl>
 1 FF00018        p1_1000          0            0 p1_1000   mg_p1_…           48
 2 FF00051        p1_1000          0            0 p1_1000   mg_p1_…           48
 3 UC101_2_33     p1_1000          0            0 p1_1000   mg_p1_…           48
 4 FF00004        p1_1000          0            0 p1_1000   mg_p1_…           48
 5 FF00064        p1_1000          0            0 p1_1000   mg_p1_…           48
 6 FF00072        p1_1000          0            0 p1_1000   mg_p1_…           48
 7 UC119_2_PB0238 p1_1000          0            0 p1_1000   mg_p1_…           48
 8 122010_1999_16 p1_1000          0            0 p1_1000   mg_p1_…           48
 9 185329_1999_17 p1_1000          0            0 p1_1000   mg_p1_…           48
10 C0059E1        p1_1000          0            0 p1_1000   mg_p1_…           48
# ℹ 310 more rows
# ℹ 5 more variables: strain <chr>, LBP <fct>, strain_origin <fct>,
#   Biose_ID <chr>, VMRC_ID <chr>

Weekly 16S rRNA data

Code
se_16S
# A SummarizedExperiment-tibble abstraction: 3,000 × 13
# Features=3 | Samples=1000 | Assays=ASV_counts, ASV_prop
   .feature  .sample ASV_counts ASV_prop barcode pid   visit_code Kingdom Phylum
   <chr>     <chr>        <dbl>    <dbl> <chr>   <fct>      <dbl> <chr>   <chr> 
 1 L. crisp… p1_0            30    0.24  16S_p1… p1             0 Bacter… Firmi…
 2 L. iners  p1_0            36    0.288 16S_p1… p1             0 Bacter… Actin…
 3 non-Lacto p1_0            59    0.472 16S_p1… p1             0 Bacter… Actin…
 4 L. crisp… p1_1000         49    0.340 16S_p1… p1          1000 Bacter… Firmi…
 5 L. iners  p1_1000         42    0.292 16S_p1… p1          1000 Bacter… Actin…
 6 non-Lacto p1_1000         53    0.368 16S_p1… p1          1000 Bacter… Actin…
 7 L. crisp… p1_1100         45    0.344 16S_p1… p1          1100 Bacter… Firmi…
 8 L. iners  p1_1100         36    0.275 16S_p1… p1          1100 Bacter… Actin…
 9 non-Lacto p1_1100         50    0.382 16S_p1… p1          1100 Bacter… Actin…
10 L. crisp… p1_1200         47    0.362 16S_p1… p1          1200 Bacter… Firmi…
# ℹ 50 more rows
# ℹ 4 more variables: Class <chr>, Order <chr>, Family <chr>, Genus <chr>

Assembling the MultiAssayExperiment object

Code
assay_list <- 
  list(
    visit_attendance = se_va,
    mg_ksanity = se_mg,
    amplicon_ASV_all = se_16S
    )

mae <- 
  MultiAssayExperiment::MultiAssayExperiment(
    experiments = MultiAssayExperiment::ExperimentList(assay_list),
    colData = mae_coldata |> dplyr::rename(.sample = sample_id),
    metadata = list(data_source = data_source, date_created = today())
  )
Code
mae
A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 3:
 [1] visit_attendance: SummarizedExperiment with 2 rows and 4600 columns
 [2] mg_ksanity: SummarizedExperiment with 16 rows and 900 columns
 [3] amplicon_ASV_all: SummarizedExperiment with 3 rows and 1000 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Checks, QC, and Outcomes Definitions

Checks and Quality Controls

Visualization of available data for each assay and each site x arm x participant x visit

Code
available_data <- 
  mae[["visit_attendance"]] |> 
  as_tibble() |> 
  filter(.feature == "attended") |> 
  dplyr::rename(attended = visit_attendance) |> 
  left_join(
    mae@sampleMap |> 
      as_tibble() |> 
      filter(assay != "visit_attendance") |> 
      mutate(.sample = primary) |> 
      select(.sample, assay) |> 
      mutate(available_data = TRUE),
    by = join_by(.sample)
  ) |> 
  left_join(
    mae@colData |> as_tibble() |> select(.sample, pid, site, arm, starts_with("visit_"))
  )

assay_collection_schedule <- 
  available_data |> 
  filter(!is.na(assay)) |> 
  group_by(assay, visit_code) |> 
  summarize(assay_scheduled_at_visit = !all(is.na(available_data)), .groups = "drop") |> 
  filter(assay_scheduled_at_visit)

available_data <- 
  available_data |> 
  left_join(
    assay_collection_schedule,
    by = join_by(assay, visit_code)
  )

available_data <- 
  available_data |> 
  filter(assay_scheduled_at_visit) 
Code
available_data |> 
  ggplot(aes(x = assay, y = pid, col = assay)) +
  geom_point() +
  facet_grid(arm + site ~ visit_code, scales = "free", space = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), legend.position  = "none") 

Checking for contamination

While it is possible for a few Placebo participants to natively have strains that are extremely similar to the LBP strains, it is highly unlikely for many of the LBP strains to be present simultaneously in a single sample for a placebo participant. Such event would much more likely be the result of contamination OR a mislabeling of the sample OR a randomization issue.

We thus check how many LBP strains are simultaneously present in the Placebo participants samples

TODO

Other checks ?

Outcomes definitions

Primary outcome definition

The primary outcome is XXX INSERT DEFINITION XXX

Consequently, we need to compute the relative abundance of each LBP strains from their proportions out of total L. crispatus proportions and the total L. crispatus proportions as estimated by 16S sequencing.

LBP strains relative abundances in each samples

This is stored in a new assay: LBP_rel_ab

Code
se_LBP_rel_abundances <- 
  list(
    LBP_rel_ab = 
      mae[["mg_ksanity"]] |> 
      filter(!is.na(LBP)) |> 
      left_join(
        mae[["amplicon_ASV_all"]] |> 
          filter(str_detect(.feature, "crispatus")) |> 
          as_tibble() |> 
          select(.sample, ASV_prop) |> 
          group_by(.sample) |> summarize(Lc_prop = sum(ASV_prop))
      ) |> 
      mutate(
        LBP_rel_ab = prop_of_Lc * Lc_prop
      ) |> 
      select(-prop_of_Lc, -unique_kmers) 
  )


mae <- c(mae, se_LBP_rel_abundances)

At each visit

From this, we can compute the total relative abundance of all LBP strains or the maximum relative abundance of any LBP strain in each sample.

If the total relative abundance of all LBP strains is larger than 0.1 or the maximum relative abundance of any LBP strain is larger than 0.05, we consider that the LBP achieved colonization at that visit.

Code
colonization <- 
  mae[["LBP_rel_ab"]] |> 
  as_tibble() |>
  left_join(colData(mae) |> as_tibble(),by = join_by(.sample)) |> 
  group_by(.sample, pid, visit_code) |> 
  summarize(
    total_prop_at = sum(LBP_rel_ab),
    max_prop_at = max(LBP_rel_ab),
    .groups = "drop"
  ) |> 
  mutate(
    colonized_at = (total_prop_at > 0.1) | max_prop_at > 0.05
  ) 

By each visit

From the colonization status at each visit, we can compute the colonization status by each visit. A participant is considered to have colonized by a visit if they have been colonized at that visit or any previous visit.

Code
colonization <- 
  colonization |> 
  group_by(pid) |> 
  arrange(pid, visit_code) |>
  mutate(
    colonized_by = cumsum(colonized_at) > 0
  ) |> 
  ungroup()

Adding primary assay to MAE

Code
se_primary <-
  SummarizedExperiment(
    assays = list(
      primary = 
        colonization |> 
        select(-pid, -visit_code) |> 
        as.data.frame() |> 
        column_to_rownames(".sample") |> t()
    )
  )

mae <- c(mae, list(LBP_colonization = se_primary))

Secondary outcomes definitions

16S rRNA microbiota composition, aggregated at the species level

TODO

16S rRNA microbiota composition, augmented with LBP strains relative abundances

We combine the 16S rRNA microbiota composition with the LBP strains relative abundances to create a new assay mb_composition.

TODO

We now have an “augmented” MAE object with the following assays:

Code
mae
A MultiAssayExperiment object of 5 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 5:
 [1] visit_attendance: SummarizedExperiment with 2 rows and 4600 columns
 [2] mg_ksanity: SummarizedExperiment with 16 rows and 900 columns
 [3] amplicon_ASV_all: SummarizedExperiment with 3 rows and 1000 columns
 [4] LBP_rel_ab: SummarizedExperiment with 15 rows and 900 columns
 [5] LBP_colonization: SummarizedExperiment with 4 rows and 900 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Analyses

Demographics (Table 1)

TODO

Primary outcomes (Table 2)

Colonization by week 5

Code
col_by_week5 <- 
  colonization |> 
  left_join(colData(mae) |> as_tibble()) |> 
  filter(visit_code == 1500)

Visualization of the primary outcome

Code
col_by_week5 |> 
  arrange(site, arm, colonized_by) |>
  group_by(site, arm) |> 
  mutate(participant_number = row_number() |> factor()) |> 
  ungroup() |> 
  ggplot(aes(x = participant_number, y = site, col = site, fill = site, shape = colonized_by)) +
  geom_point(size = 3) +
  facet_grid(. ~ arm, scales = "free_x") +
  guides(color = "none", fill = "none") +
  scale_shape_manual("Colonized by LBP by week 5", values = c(1, 16)) +
  xlab("Participant number") + ylab("")

Code
summary_table <- 
  col_by_week5 |> 
  group_by(site, arm) |> 
  summarize(
    n = n(), 
    n_success = sum(colonized_by),
    .groups = "drop"
  ) |> 
  mutate(
    p = n_success / n,
    CI = binom::binom.confint(n_success, n, method = "exact")
  )
Code
summary_table |> 
  mutate(
    N_per_site_and_arm = str_c("N = ", n),
    LBP_strain_detected = str_c(n_success, " (", round(p * 100)," %)"),
    CI_text = str_c(round(CI$lower * 100), "% - ", round(CI$upper * 100), "%")
  ) |> 
  select(
    site, arm, N_per_site_and_arm, LBP_strain_detected, CI_text
  ) |> 
  pivot_longer(-c(site, arm)) |>
  mutate(
    name = 
      case_when(
        name == "N_per_site_and_arm" ~ "N participants",
        name == "LBP_strain_detected" ~ "n (%) of participants\nwith LBP strain detected",
        name == "CI_text" ~ "95% CI"
      )
  ) |> 
  pivot_wider(id_cols = c(site, name), names_from = arm, values_from = value) |> 
  group_by(site) |> 
  gt(caption = "Table 1: Colonization by week 5 by arm and site") |> 
  cols_width(
    "name" ~ px(200),
    everything() ~ px(120)
  ) |> 
  cols_label(
    name = "",
    Pl = "Placebo",
    `LC-106-7` = "LC-106<br>7 days",
    `LC-106-3` = "LC-106<br>3 days",
    `LC-106-o` = "LC-106<br>overlap",
    `LC-115` = "LC-115<br>7 days",
    .fn = md
  )
Table 1: Colonization by week 5 by arm and site
Placebo LC-106
7 days
LC-106
3 days
LC-106
overlap
LC-115
7 days
CAP
N participants N = 10 N = 10 N = 10 N = 10 N = 10
n (%) of participants with LBP strain detected 0 (0 %) 7 (70 %) 5 (50 %) 9 (90 %) 8 (80 %)
95% CI 0% - 31% 35% - 93% 19% - 81% 55% - 100% 44% - 97%
MGH
N participants N = 10 N = 10 N = 10 N = 10 N = 10
n (%) of participants with LBP strain detected 0 (0 %) 9 (90 %) 3 (30 %) 7 (70 %) 8 (80 %)
95% CI 0% - 31% 55% - 100% 7% - 65% 35% - 93% 44% - 97%

TODO: include the p-values from model/test below

Tests / models

If we have 0 successes in none of the Placebo arm, we won’t be able to use a logistic regression model that accounts for site differences since we have a “perfect separation issue” and the odds are infinite for the Placebo arm:

Code
fit <- glm(colonized_by ~ arm + site + arm:site, data = col_by_week5, family = binomial)
fit |> summary()

Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial, 
    data = col_by_week5)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)         -1.857e+01  2.063e+03  -0.009    0.993
armLC-106-7          1.941e+01  2.063e+03   0.009    0.992
armLC-106-3          1.857e+01  2.063e+03   0.009    0.993
armLC-106-o          2.076e+01  2.063e+03   0.010    0.992
armLC-115            1.995e+01  2.063e+03   0.010    0.992
siteMGH              9.849e-11  2.917e+03   0.000    1.000
armLC-106-7:siteMGH  1.350e+00  2.917e+03   0.000    1.000
armLC-106-3:siteMGH -8.473e-01  2.917e+03   0.000    1.000
armLC-106-o:siteMGH -1.350e+00  2.917e+03   0.000    1.000
armLC-115:siteMGH   -9.848e-11  2.917e+03   0.000    1.000

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.186  on 99  degrees of freedom
Residual deviance:  83.534  on 90  degrees of freedom
AIC: 103.53

Number of Fisher Scoring iterations: 17

There are a few ways to remedy that:

  • One way is to use a “pseudo-Bayesian” approach and to add 1 artificial success to all arms, but this introduces a small “defavorable” bias in the odds ratio estimates (and theoretically increases the p-values).
Code
tmp <- 
  col_by_week5 |> 
  bind_rows(
    col_by_week5 |> 
      select(arm, site) |> 
      distinct() |> 
      mutate(colonized_by = TRUE)
  )

fit <- glm(colonized_by ~ arm + site + arm:site, data = tmp, family = binomial)
fit |> summary()

Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial, 
    data = tmp)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)   
(Intercept)         -2.303e+00  1.049e+00  -2.196  0.02810 * 
armLC-106-7          3.283e+00  1.248e+00   2.631  0.00852 **
armLC-106-3          2.485e+00  1.211e+00   2.052  0.04015 * 
armLC-106-o          4.605e+00  1.483e+00   3.105  0.00190 **
armLC-115            3.807e+00  1.308e+00   2.910  0.00361 **
siteMGH             -7.406e-16  1.483e+00   0.000  1.00000   
armLC-106-7:siteMGH  1.322e+00  1.938e+00   0.682  0.49529   
armLC-106-3:siteMGH -7.419e-01  1.720e+00  -0.431  0.66622   
armLC-106-o:siteMGH -1.322e+00  1.938e+00  -0.682  0.49529   
armLC-115:siteMGH    5.634e-16  1.850e+00   0.000  1.00000   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 148.06  on 109  degrees of freedom
Residual deviance: 103.03  on 100  degrees of freedom
AIC: 123.03

Number of Fisher Scoring iterations: 4
  • Another way is to use an actual Bayesian model that accounts for site differences, but this does not provide p-values, only posterior distributions (and confidence intervals).
Code
col_by_week5_agg <- 
  col_by_week5 |>
  group_by(site, arm) |>
  summarise(success = colonized_by |> sum(), ntrials = n())

bfit <- 
  brm(
    success | trials(ntrials) ~ arm + site + arm:site, 
    data = col_by_week5_agg, 
    family = binomial("logit")
    )
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 16.0.0 (clang-1600.0.26.4)’
using SDK: ‘’
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Users/laurasymul/Library/R/arm64/4.4/library/Rcpp/include/"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/unsupported"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/BH/include" -I"/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/src/"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppParallel/include/"  -I"/Users/laurasymul/Library/R/arm64/4.4/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/Core:19:
/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
  679 | #include <cmath>
      |          ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.2 seconds (Warm-up)
Chain 1:                0.208 seconds (Sampling)
Chain 1:                0.408 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.02 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.21 seconds (Warm-up)
Chain 2:                0.238 seconds (Sampling)
Chain 2:                0.448 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 4e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.22 seconds (Warm-up)
Chain 3:                0.227 seconds (Sampling)
Chain 3:                0.447 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 3e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.21 seconds (Warm-up)
Chain 4:                0.209 seconds (Sampling)
Chain 4:                0.419 seconds (Total)
Chain 4: 
Code
summary(bfit)
 Family: binomial 
  Links: mu = logit 
Formula: success | trials(ntrials) ~ arm + site + arm:site 
   Data: col_by_week5_agg (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept             -33.60     37.34  -139.63    -2.93 1.03      180      194
armLCM106M7            34.56     37.35     3.76   140.72 1.03      181      191
armLCM106M3            33.60     37.35     2.90   139.74 1.03      179      192
armLCM106Mo            36.39     37.30     5.51   142.30 1.03      183      190
armLCM115              35.22     37.34     4.33   140.52 1.03      181      194
siteMGH                -2.70     49.51  -118.58   101.28 1.02      160      152
armLCM106M7:siteMGH     4.53     49.53  -100.43   118.93 1.02      160      154
armLCM106M3:siteMGH     1.77     49.54  -102.76   117.69 1.02      160      151
armLCM106Mo:siteMGH     0.89     49.54  -103.39   117.47 1.02      160      153
armLCM115:siteMGH       2.70     49.50  -101.55   118.59 1.02      161      155

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(bfit)

Code
bind_cols(col_by_week5_agg, predict(bfit))
# A tibble: 10 × 8
# Groups:   site [2]
   site  arm      success ntrials Estimate Est.Error  Q2.5 Q97.5
   <fct> <fct>      <int>   <int>    <dbl>     <dbl> <dbl> <dbl>
 1 CAP   Pl             0      10   0.0418     0.268     0     1
 2 CAP   LC-106-7       7      10   7.02       1.95      3    10
 3 CAP   LC-106-3       5      10   5.02       2.15      1     9
 4 CAP   LC-106-o       9      10   9.06       1.26      6    10
 5 CAP   LC-115         8      10   8.01       1.72      4    10
 6 MGH   Pl             0      10   0.0445     0.294     0     1
 7 MGH   LC-106-7       9      10   9.05       1.25      6    10
 8 MGH   LC-106-3       3      10   2.99       1.93      0     7
 9 MGH   LC-106-o       7      10   7.05       1.95      3    10
10 MGH   LC-115         8      10   7.99       1.71      4    10
Code
plot(conditional_effects(bfit), ask = FALSE)

Code
# loo(bfit, moment_match = TRUE, reloo = TRUE)
# pp_check(bfit)
  • Alternatively, we can use exact fisher tests (as initially proposed by Lara Lewis, CAPRISA) for each arm x site, correcting for multiple testing, but this does not allow us to directly test for site differences.
Code
fisher_test_res <- tibble()

for (selected_site in levels(col_by_week5$site)) {
  for (selected_arm in levels(col_by_week5$arm)[-1]) {
    ft <- 
      col_by_week5 |> 
      filter(site == selected_site, arm %in% c("Pl", selected_arm)) |> 
      mutate(arm = arm |> fct_drop()) |> 
      select(colonized_by, arm) |> 
      table() |> 
      fisher.test()
    fisher_test_res <- 
      fisher_test_res |> 
      bind_rows(
        tibble(
          site = selected_site,
          arm = selected_arm,
          odds = ft$estimate,
          pvalue = ft$p.value
        )
      )
  }
}

fisher_test_res <- 
  fisher_test_res |> 
  mutate(
    qval = p.adjust(pvalue, method = "BH")
  )

fisher_test_res |> 
  group_by(site) |> 
  gt(caption = "Fisher exact test results")
Fisher exact test results
arm odds pvalue qval
CAP
LC-106-7 Inf 0.0030959752 0.0041279670
LC-106-3 Inf 0.0325077399 0.0371517028
LC-106-o Inf 0.0001190760 0.0004763039
LC-115 Inf 0.0007144558 0.0014289116
MGH
LC-106-7 Inf 0.0001190760 0.0004763039
LC-106-3 Inf 0.2105263158 0.2105263158
LC-106-o Inf 0.0030959752 0.0041279670
LC-115 Inf 0.0007144558 0.0014289116

Secondary outcomes (Figures 2-4)

Kinetics of colonization

Total proportions of LBP isolates

Code
g_tot_prop_LBP_strains <- 
  colonization |> 
  as_tibble() |> 
  left_join(mae |> colData() |> as_tibble()) |>
  filter(visit_abbr != "SCR") |> 
  mutate(visit_week = visit_number - 1) |> 
  ggplot(aes(x = visit_week, y = total_prop_at, col = arm)) +
  geom_line(aes(group = pid), alpha = 0.5) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(site ~ arm) + 
  scale_x_continuous("Study weeks", breaks = c(0:5,7,9,12), minor_breaks = 0:12) +
  scale_y_continuous(
    "Total proportion of LBP strains", labels = scales::percent,
    breaks = seq(0, 1, by = 0.2)
  ) +
  expand_limits(y = 0) +
  scale_color_manual(values = get_arm_colors()) +
  guides(col = "none") +
  ggtitle(make_title(mae)) 
Code
g_tot_prop_LBP_strains

Proportions of participants who colonized

Code
g_prop_colonized_by <- 
  colonization |> 
  as_tibble() |> 
  left_join(mae |> colData() |> as_tibble()) |>
  filter(visit_abbr != "SCR") |> 
  mutate(visit_week = visit_number - 1) |> 
  group_by(site, arm, visit_week) |>
  summarize(
    prop_who_colonized_at = mean(colonized_at), 
    # add confidence interval
    .groups = "drop"
  ) |>
  ggplot(aes(x = visit_week, y = prop_who_colonized_at, col = arm)) +
  geom_line() +
  geom_point() +
  facet_grid(site ~ .) + 
  scale_x_continuous("Study weeks", breaks = c(0:5,7,9,12), minor_breaks = 0:12) +
  scale_y_continuous(
    "Proportion of participants\nwith detected LBP strains", labels = scales::percent,
    breaks = seq(0, 1, by = 0.2)
  ) +
  expand_limits(y = 0) +
  scale_color_manual(values = get_arm_colors()) +
  ggtitle(make_title(mae)) 

g_prop_colonized_by

Code
g_prop_colonized_by

Figure 2

Code
 g_prop_colonized_by + 
  g_tot_prop_LBP_strains +
  plot_layout(guides = "collect", widths = c(1, 5)) +
  plot_annotation(tag_level = "a") &
  theme(legend.position = "bottom")

Microbiota trajectories

TODO

Code
# simplified_compositions <- 
#   mae[["ASV_all"]] |> 
#   as_tibble() |> 
#   mutate(
#     taxa = 
#       case_when(
#         str_detect(.feature, "L.crispatus[0-9]") ~ "L. crispatus (LBP strains)",
#         str_detect(.feature, "crispatus") ~ "L. crispatus (non-LBP strains)",
#         str_detect(.feature, "iners") ~ "L. iners",
#         str_detect(.feature, "Lactobacillus") ~ "other Lactobacillus spp.",
#         TRUE ~ "Non Lactobacillus spp."
#       ) |> factor()
#   ) |> 
#   group_by(taxa, .sample) |> 
#   summarize(prop = sum(ASV_prop), .groups = "drop") |> 
#   left_join(mae |> colData() |> as_tibble(), by = join_by(.sample == sample_id)) 
Code
# 
# g_trajectories <- 
#   simplified_compositions |> 
#   arrange_participants(by = taxa) |> 
#   filter(visit_code != 0) |> 
#   ggplot(aes(x = prop, y = pid, fill = taxa)) +
#   geom_col() +
#   facet_grid(site + arm ~ visit_code, scales = "free", space = "free") +
#   theme(panel.spacing.x = unit(0, "points")) +
#   scale_fill_manual(values = simplified_compositions$taxa |> levels() |> get_taxa_color()) +
#   scale_x_continuous(
#     "relative abundance", labels = scales::percent, 
#     breaks = seq(0, 0.5, by = 0.5), minor_breaks = seq(0, 1, by = 0.25)
#   ) +
#   scale_y_discrete("Participants", breaks = NULL) +
#   theme(
#     strip.text.y = element_text(angle = 0, hjust = 0),
#     strip.background = element_rect(color = "white")
#   )
Code
# g_trajectories

Individual LBP strains by sites

Code
LBP_strains <- 
  mae[["LBP_rel_ab"]] |> 
  as_tibble() |> 
  left_join(mae |> colData() |> as_tibble())
Code
LBP_strains_summary <- 
  LBP_strains |> 
  group_by(pid, site, arm, .feature) |> 
  summarize(max_prop = max(LBP_rel_ab), .groups = "drop")


LBP_strains_summary <- 
  LBP_strains_summary |> 
  left_join(
    mae[["mg_ksanity"]] |> 
      as_tibble() |> 
      select(.feature, LBP, strain_origin) |> 
      distinct(),
    by = join_by(.feature)
  )
Code
LBP_strains_summary |> 
  filter(arm != "Pl") |>
  mutate(strain_origin = str_c("orgin: ", strain_origin)) |>
  ggplot(aes(x = site, y = max_prop, col = site, fill = site)) +
  ggbeeswarm::geom_quasirandom(alpha = 0.7) +
  geom_boxplot(alpha = 0.3, outlier.shape = NA) +
  facet_grid(. ~ strain_origin + .feature) +
  scale_color_manual(values = get_site_colors()) +
  scale_fill_manual(values = get_site_colors()) +
  scale_y_continuous(
    str_c(
      "Max. proportion of LBP strain\n",
      "in active arms participants\n",
      "at any weekly visit"),
    labels = scales::percent, breaks = seq(0, 1, by = 0.05)
    ) + expand_limits(y = 0) +
  scale_x_discrete("", breaks = NULL) +
  theme(
    # strip.text.x = element_text(angle = 45),
    legend.position = "bottom"
  )

Code
LBP_strains_summary |> 
  filter(arm != "Pl") |> 
  group_by(site, .feature, strain_origin) |> 
  summarize(prop_detected = mean(max_prop >= 0.05), .groups = "drop") |> 
  mutate(origin = str_c("strain origin: ", strain_origin)) |> 
  ggplot(aes(x = .feature, y = prop_detected, fill = site)) +
  geom_col(position = "dodge", width = 0.7) +
  facet_grid(. ~ origin, scale = "free", space = "free") +
  xlab("LBP strain") +
  scale_y_continuous(
    str_c(
      "Proportion of active arms participants\n",
      "with detected LBP strain\n",
      "(rel. abundance ≥ 5% at any weekly visit)"),
    labels = scales::percent, breaks = seq(0, 1, by = 0.2)
    ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = get_site_colors())